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Towards The Next Generation Multimedia Presentation 


Abstract 


Conventional multimedia productions display a 
sequence of frames to the user in a linear or non-linear 
way depending on user interactions. Each frame consists 


of a fixed set of information through the incorporation of 


media elements embedded within the frames. Such a 
traditional framework however fail to meet many of the 
modern day requirements like need for frequent content 
updation, customizing the presentation according to user 
requirements, transmitting the presentation over 
lowbandwidth networks and supporting distributed 
media components. This paper has been inspired by the 
lack of tools to solve all of these challenging problems. It 
aims at introducing a new framework in structuring 
multimedia presentations by moving the focus from the 
concept of frames to the concept of objects. By 






representing a multimedia presentation as a set of 


meaningful objects, each associated with its own spatio- 
temporal parameters, it becomes possible to customize 
presentations as per specific user needs and introduce 
functionalities like content searching, which can benefit 
application areas like distance learning. The paper also 
discusses issues related to playback of such a 
presentation over a distributed environment. Moreover it 
is seen that such frameworks can be made to support and 
utilize emerging standards like MPEG-4, MPEG-7 and 
MPEG-21. 
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1. Introduction 


Over the last decade there has been a huge proliferation 
on the use of multimedia content throughout the world. 
Application areas like digital photo albums, computer 
based training and learning packages, games and 
children entertainment, online business and corporate 
presentations, information kiosks and simulation 
packages have led to the growth of a large number of 
varied presentations. Traditionally these presentations 
are created by embedding a fixed set of media elements 
within a presentation template, specifying their spatial 
and temporal attributes, building interactive pathways 
for nonlinear navigation and compiling the entire 
collection as executable files with their own run-time 
engines. Authoring tools like Director, Toolbook, 
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Authorware etc. have been used for such purposes. In 
recent times however people have begun to look at 
other fields as possible application areas of multimedia 
technology. These areas are quite different in various 
aspects from their traditional counterparts and present 
new design and implementation challenges to the 
developers, which are extremely difficult and 
inconvenient to be handled using existing 
methodologies. Rather they require a paradigm shift of 
current practices and call for some new concepts to be 
incorporated for adequate solutions. 





2. Problem Analysis 


One such situation occurs where media components 
need to be frequently changed or updated, possibly to 
reflect the changes in a physical entity over which it is 
modeled, for example a digital art gallery. In a real 
world gallery new specimens might be added to the 
gallery every day or older specimens taken out. Even 
for existing artwork, specimens might be grouped into 
different categories based on subject matter or 
chronology, descriptive text captions could be changed 
to accommodate new information or newer links 
between related artworks could be created. To reflect 
the same changes in the presentation, structure of the 
original presentation would need to be modified on a 
periodic basis. Since an executable file precludes any 
change to its internal structure, the development work 
would continuously involve changing source codes, 
recompiling the entire presentation and its distribution 
to the audience periodically. This involvement of time 
and effort as well as the inconvenience of repeated 
distribution would make the system economically and 
practically infeasible. Failure to do so would make the 
digital version hopelessly outdated with respect to the 
real-world entity. 


Another situation arises from the fact that the user 
requirements may be too large or varied to be 
envisaged and implemented adequately in a single 
fixed presentation. Let us consider a scenario where a 
multimedia presentation is being used to model a 
digital museum. Traditionally a menu structure is used 
to navigate to different sections of a presentation as 
determined by the author. However choices in the 
menu should also reflect the preferences of different 
users for navigating through the presentation. For 
example one user might want to view rock samples 
from a specific geological era, while another might 
want to view rock samples found in a specified 


24 


PB, 


rexpagation 


A Journal of Science Communication 





geographical region, while a third might want to 
view rock samples containing a particular ore of iron. 
It could easily be appreciated that the large number 
of potential choices could hardly be satisfied using 
rigid hyperlinks. Moreover it is entirely possible that 
the same page may be part of the result set of all the 
above three queries. Existing authoring tools do not 
support creation of dynamic links to reflect varying 
user choices. Such variety would also be present in 
other related scenarios like presentations linked to a 
digital audio library where users can ask for 
playback of a certain composition by specifying 
artists, music types, chronological period etc., or a 
video-on-demand systems where users might have a 
wide number of choices pertaining to 
actors/actresses, director, music director, 
chronological period, genre, awards ete. 


Computer systems nowadays are rarely stand-alone 
and usually linked to local or global networks. 
Traditionally multimedia presentations have been 
developed for playback on a single computer. They 
may exhibit inconsistent behaviours when accessed 
by multiple users from different hosts on the 
network, To cater to modern computing needs and 
utilize the advantages of networking, the 
architecture of multimedia presentations would need 
to be modified so that they can function in multi-user 
environments with different users browsing different 
pages and activating different links simultaneously 
from different geographical locations. The issue of 
where the presentation itself should reside is also 
important from the viewpoint of optimum efficiency 
and performance. Media components embedded 
within a presentation increases its byte size to the 
order of hundreds of megabytes. Ifthe presentation is 
to work in a networked environment, byte size 
becomes a crucial factor, especially for bandwidth 
sensitive networks. In some situations the 
presentation may need to be pdpulated with 
distributed media components e.g. teachers and 
experts may create content for e-learning packages 
from geographically disbursed locations. In such 
scenarios, issues which need to be considered are, on 
one hand, how the presentation would maintain its 
structural integrity, and on the other, how the spatio- 
temporal properties of media components remain 
intact as they are fetched over networks with 
potentially varying transmission speeds and delays. 





3. Requirement Analysis 


In this section we attempt to build up a set of 
requirement factors that would adequately address the 
problems discussed in the previous section. 


3.1 Traditional view of multimedia 
presentations 


Traditionally a multimedia presentation is viewed as a 
single executable file containing its own run-time 
engine, and within which all the media components are 
embedded. Each of the components are associated with 
a set of spatio-temporal parameters which determine 
how and when it appears within the presentation, A set 
of event handlers are present for accepting user and 
system initiated events and allowing the user to 
navigate in a nonlinear fashion. A multimedia 
presentation MMP is therefore represented as a 4-tuple 
MMP = {M, P, E, R}, as shown in Fig. 1, where M 
denotes the set of all media elements, P the set of 
spatio-temporal parameters, E the set of event handlers 
and R the run-time commands, also called run-time 
engine, which is contained within and inseparable 
from the presentation. 
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MMP(exe) 


Fig.1 : Traditional view of multimedia presentation. 


3.2 Separation of media 


In order to support frequently updated media 
components and also to enable them to be searched for, 
based on user defined criteria, the first modification 
which needs to be done to the traditional structure is the 
separation of the media components from the rest of 
the presentation. This situation is depicted in Fig. 2. 
The set of media components M is now separated from 
the multimedia presentation MMP, and stored in an 
external media repository MRP. For the time being let 
us consider MRP to be resident in the same local 
machine where MMP is executed and ignore the delay 
associated with data flow between them. The set of 
media elements M is referred to from MMP via a set of 
locational parameters, typically pathnames, indicated 
by the symbolic link L/. Data actually flows from MRP 
to MMP during playback via physical link 12. A 
distinct advantage of this scenario is the reduction of 
the size of MMP due to the separation of M into MRP. 
This translates into a lower startup delay during the 
execution of MMP, as media elements are pulled into 
MMP asand when they are required, instead of loading 
into memory all at the same time. Another major 
advantage is that it would enable developers to change 
content without modifying the presentation and would 
be especially helpful for applications requiring 
frequent updation of media content. New content 
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material could just be loaded into the MRP when 
required, overwriting the old content, and the next time 
the presentation runs, it would automatically reflect the 
latest updates. A third benefit would be the ability to 
enable structural modifications of MRP independent of 
MMP. 
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Fig. 2 : Separation of media. 


3.3 Separation of run-time engine 


P and E reflect the variations how various media 
components appear and disappear within the presentation, 
and expressed as a set of instructions, typically using 
some internal language supported by the application 
software which has been used for creating MMP e.g. 
Lingo for Macromedia Director’. These instructions are 
then interpreted by the run-time engine R, which also 
accepts the media components from MRP and finally play 
them back. R represents a fixed command set, which take 
as input the set of media components and the instructions, 
and produce as output the final presentation. Thus R may 
be separated ftom MMP into an external component 
called playback module PBM as shown in Fig. 3. MMP 
remains connected to MRP through link L/ and to PBM 
via link £3 for transmitting P and £. The media 
components now actually flow to PBM via link L2 where 
they get displayed. PBM would be a fixed module for a 
particular class of presentation and could stay resident on 
the target machine from beforehand, thereby avoiding the 
overhead of transporting it along with MMP. This would 
be more crucial when network environments are 
considered. The separation would also mean that PBM 
could be optimized for efficiency independent of MMP. 
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Fig. 3 : Separation of run-time engine. 


3.4 Scripting Representation 


MMP has till now been considered as an executable file 
which needs to be compiled for playback. With the 
separation of the run-time module, MMP is left with P 
and £, both of which can be expressed as instruction 
sets. 


In order to address the problem of frequent and quick 
modifications of multimedia presentations, it is 
proposed that MMP be represented using a scripting 
language instead of an executable file. The script 
representation would preserve the compact size of the 
presentation and would enable the developer to make 
frequent updating of the design and layout, without the 
need to go through a time-consuming recompilation 
phase every time. This would help to reduce 
development and content creation time. Also a script 
file being essentially textual in nature, could be edited 
easily using text editors, thereby avoiding the need for 
specialized software. This might be beneficial from the 
economic viewpoint. The PBM in thi: would need 
to function as an interpreter of the script in the MMP 
and would contain interpreter functions / in addition to 
R. This is depicted in Fig.4. A typical example of sucha 
script language would be SMIL by W3C"'. 
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Fig. 4 : Scripting representation. 


3.5 Media Searching and Retrieval 


Till now a fixed set of instructions within the MMP 
locate the actual media components M within MRP 
and cause them to be played by the PBM. This leads to 
a static structure essentially decided by the developer. 
We have however seen above that a dynamic 
presentation structure is desirable, where users can 
themselves search for and retrieve media components 
according to their customized needs. The first step for 
this important functionality was already achieved 
when media components were separated from the 
presentation and therefore retain their individual 
entities within the MRP. This advantage could be 
exploited to enable the users to search for and retrieve 
specific elements from the MRP and dynamically 
insert them within MMP. We propose the inclusion ofa 
multimedia database MDB into the architecture to 
handle the queries generated by the user (Fig. 5). We 
discuss the structure of the MDB in section 4 later, for 
the time being we assume, features of media files kept 
in the MRP are extracted by an extraction function Y 
and kept in the MDB. These feature information serve 
to characterize the media components and flow from 
the MRP to the MDB via link LS. The features however 
remain linked to the actual files through symbolic 
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references L/. Queries can be posted using query 

interface functions Q in the PBM and are transmitted 

via link L4 to the MDB, where they are interpreted and 

a result list is generated by a searching function S. Q 

can serve as a parser to check the validity of the query 

syntax. The result list would typically be references to 
.2 
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Fig. 5 : Media searching and retrieval. 


set of media elements in the MRP, which match the 
specified criteria in the query. The symbolic 
nces between the result list in the MDB and M is 
indicated by same link L/. The result list is displayed 
back to the PBM using link L4, using which the user can 
further modify the result list by picking one or more 
specific items from it or generating additional queries. 
After the final modification of the result list, the selected 
item names are passed to the MMP via link L6. MMP 
does not contain the actual names of the media elements, 
but only place-holders which gets subsequently filled up 
from the MDB via L6. At the same time, the actual media 
items referenced are sent from the MRP to the PBM via 
link L2 to be displayed/played within the presentation. To 
actually play those elements the PBM would also need 
the P and £ information from the MMP, using link L3. 


3.6 Distributed media 








Until now we have been assuming that the media 
components are resident on a local system. In a general 
scenario these components might be distributed over 
an accessible network including the Internet. 
Referencing the files from the MDB could be done 
through fully qualified domain names (FQDN). 
Population of the database is done off-line so network 
delays would not be crucial during this phase. 
However during playback of the presentation, network 
delays can assume important proportions as the media 
files flow from the now distributed MRP to the PBM. 
This can increase the skew to a large extent so that 
synchronization constraints may need to be imposed. 
The playback issues are discussed in detail in section 5. 


4. Storage and Retrieval 
The inclusion of an MDB means that the database 


needs to be populated before it can be utilized. This is 
typically done off-line by a supervisory system before 


it can handle on-line user queries. A multimedia 
database can be defined as a data structure which 
enables searching and retrieval of textual as well as 
non-textual media like image, audio and video, 
Traditionally relational databases have functioned by 
using pattern matching algorithms in textual strings to 
find query results. A media component like image or 
audio essentially consists of raw data bytes with very 
little inherent meaning or useful information within it, 
which may be used for query matching. 


4.1 Textual annotations 


The earliest approaches to media searching therefore 
relied on textual annotations manually generated by 
individuals and attached to the media component, 
thereby trying to adapt existing methodologies to a 
new application. However major drawbacks of this 
approach were soon apparent. First, the subjective 
nature of the manual descriptions precluded its 
universal application to media searching: the same 
object could be referred to using different textual 
terms, while in other cases the same term was used to 
refer to different objects. Secondly because of human 
intervention the process required huge effort and time 
involvements, which were not practicable for large 





databases. Thirdly, it could be used only when textual — 


descriptions of the item were available, which was not 
always the case. 


4.2 Low-level features 


To overcome these limitations feature extraction 
algorithms were developed which could extract low 
level features from media components in an 
unsupervised or semi-supervised manner. Apart from 
the fact that these required minimal human 
intervention and so were much faster than manual 
methods, they also overcame the limitations of 
subjective manual descriptions and so were much more 
reliable. They also overcame the limitation that queries 
need to be expressed in textual terms. A query could 
now in principle be another media element from which 
low-level features could be extracted and then 
compared with similar features previously extracted 
from media components kept in the database. The 
result list could then be generated based on the 
similarity between the query media features and 
database media features. These techniques were not 
without their own problems. Firstly the features 
extracted could not generally be expressed as text 
strings but required multi-dimensional mathematical 
vectors to be represented. Existing database structures 
could not efficiently handle such vectors and newer 
database architectures were required to be developed. 
In general it was felt that object oriented databases 
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were more efficient in handling such data structures 
than relational databases. Secondly, relational 
techniques were based on exact matching of text 
strings, but non-textual media did not usual require 
exact matches. The concept of “similarity” was needed 
to be defined using which even a partial match between 
media components would quality as they being 
regarded as similar. The concept of similarity was 
however mostly founded on human perceptions and it 
was difficult to quantify it. Also the fact there were 
large variations as to the extent of similarity between 
individuals complicated matters further. Thirdly, 
computing each and every feature required large 
resources and were usually not feasible in practice. 
Methods as to how the entire information could be 
represented using smaller and compact data sets 
became crucial. 


A wide plethora of algorithms have been developed to 
handle feature extraction. For images one of the most 
widely used features is color, either as individual pixel 
values or as histogram plots’ computed over partitions 
in the i image’ . An alternative approach called color- 
signature is used to represent color as a string of bits 
instead of a histogram’. Another approach is to isolate 
clusters of colors from images and using the amount of 
overlap between them to compute similarity”. Because 
the RGB color space is not perceptually uniform, use of 
a proper color space and color quantization scheme 
have been emphasized in all color based retrieval 
systems. In most CBR systems, the HIS (or its variant 
HSV) color space is used to represent and compare 
color values in images because of its perceptual 
similarity with the human visual system and also 
because of its invertible transformation relations with 
the RGB space"''"” Other than color, shape 
information has also been used to capture low-level 
features. Shapes of objects in an image have been 
mapped into a grid of cells and a string of bits are used 
to represent them by assigning 1 to cells covered by the 
shape and 0 to other cells”. The turning angles method’? 

uses the angles of counter-clockwise tangents to 
describe the boundary of a shape. The centroid-radii 
approach’ uses the lengths of a number of radii drawn 
from the centroid to the periphery of the shape at equal 
angle separations, to describe the shape. Texture is 
another feature which has been used to compare 
images. Textures are distinguished as_ irradiance 
patterns containing a limited range of spatial 
frequencies and orientations usually identified by 
Gabor filters’, 








For audio the first step is to categorize them into either 
speech or music. The features that can be used for 
discriminating them include average energy, silence 
ratio, zero crossing rate in time domain and pitch and 


spectral centroid in frequency domain”. Each of these 
features may be used individually in different 
classification steps” or a set of features can be used 
together as a vector”. One approach to speech 
recognition is to compare waveform patterns 
corresponding to phonemes, the smallest unit, of 
pronunciation, with segments within audio clip” in 
order to identify words spoken. Music can either be 
sample-based e.g. WAV files or structured e.g. MIDI 
files. There are two general approaches to comparing 
sample-based music. In the first approach a set of 
features are extracted from the audio clip and 
represented as a vector, which is then used for 
comparisons. These features may vary over time and 
would need to be computed for each frame. In’ five 
features namely loudness, pitch, brightness, bandwidth 
and harmonicity have been used with each feature 
being represented statistically by three parameters : 
mean, variance and autocorrelation. In the second 
approach pitch of each note is extracted, and 
represented as a string of symbols. The retrieval 
decision is based | on the similarity between query and 
candidate strings”. In structured music the notes are 
already present within the file as a set of instructions. 
These instruction sets are used to compare MIDI 
sequences for similarity”. The MPEG-4 Structured 
Audio is a new standard in this direction which 
represents, sound using algorithms and control 
languages”. 








For video the feature extraction process involves the 
identification of individual frames. Frames are then 
grouped into a shot, which represent a continuous 
action in time and space or camera motion and 
therefore contain similar low-level features. One 
promising approach to shot detection is to identify 
abrupt scene cuts as well as gradual transitions. Most of 
the approaches working on uncompressed video use 
frame difference as a measure for shot boundary 
detection. Frame differences are usually derived by 
partitioning and computing local histograms"”', To 
reduce computational loads frame sizes can be reduced 
to iconic sizes e. g. 1616 pixels”. Gradual transitions 
are detected by using a moving query window on either 
side of a current frame and computing the ratio of 
differences between pre-frames and_post-frames”’. 
Approaches working with compressed video use one or 
more features of the encoding algorithm such as DCT 
coefficients, macro-blocks and motion vectors”. 
Once the shots are identified the next task is to group 
semantically similar shots into scenes. The 
methodology followed by a majority of researchers is 
to identify certain keyframes within each shot, also 
called representative frames or r-frames” that best 
depict the content of that shot and then compare r- 
frames of different shots to gauge similarity between 
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them. Each r-frame is represented by a set of static and 
dynamic features for comparisons. Static descriptors 
include typical image features like color, texture and 
shape’. Dynamic features are a based on motion 
descriptors derived from optical flow analysis” The 
calculations are based on brightness constraint 
equations proposed by Horn and Schunck™ which 
express the fact that under constant scene illumination 
the brightness of the same physical point should 
remain constant despite camera motion. 


4.3 Identifying semantic objects 


In the context of CBR systems humans tend to pose 
queries using the semantic nature of media contents. 
Systems based on low-level features would not be very 
effective in those situations because of the gap 
between how the media content is interpreted by the 
humans and the machines. To be truly effective a CBR 
system would need to understand the semantic nature 
of media content and tune its retrieval engines 
according to the interpretation of these features. A 
semantic object (SO) is a collection of image pixels or 
audio samples that corresponds to the projection of a 
real object or event in an image, audio or video 
sequence e.g. a cluster of trees, a car moving along a 
road, a laugh or scream, To extract such objects 
automatically a clear characterization of such objects 
is required, Unfortunately since semantic objects are 
human abstractions, a unique definition does not exist. 
In addition, since semantic objects cannot generally be 
characterized by simple homogeneity criteria (e.g. 
uniform color or uniform motion), their extraction is in 
general extremely difficult and only in recent times 
this problem is being addressed by researchers. One 
approach is to use high-level descriptions instead of 
low-level features in the matching process. How to 
extract highlevel descriptions from media elements 
and to fill the gap between low-level features 2 
human's understanding of the media contents, i 
critical issue. One promising technique to solve this 
problem is to describe the media content with a 
hierarchical structure to reach progressive content 
analysis", The contents of the media can be 
represented in different levels" such as the 3- level 
representation of”: feature level, object level, scene 
level, or 5-level representation of” To fill the semantic 
gap, one makes the system extract low-level features 
while the user puts more high level knowledge and the 
system tries to integrate the two to arrive at a more 
complete description”. Among related majority of 
researchers is to identify certain keyframes within each 
shot, also called representative frames or r-frames” 

that best depict the content of that shot and then 
compare r-frames of different shots to gauge 
techniques relevance feedback has been paid a lot of 

















attention because it can combine the information from 
the user with the automatically extracted features. 
Many methods have been proposed to reach the goal of 
relevance feedback”. According to the amount of 
human intervention required we can classify the 
process as manual, semi-automatic (supervised) and 
automatic (unsupervised), In case of manual extraction 
the rules for extraction of semantic video objects are 
directly applied by the user. The method enables high 
accuracies in identifying spatial and temporal 
boundaries but is very time consuming especially for 
large video databases. A manual extraction procedure 
may be followed in some cases requiring a high quality 
production or to benchmark automatic or 
semiautomatic techniques”. . Fully automatic extraction 
procedures are still in their infancy because translating 
properties of SOs into extraction criteria is difficult. 
For this reason many semi-automatic extraction 
procedures have been proposed as a trade-off between 
fully automatic and manual extraction strategy, The 
procedure involves the interaction of the user during 
some stage of the extraction process. Usually the user 
provides initial information about video segmentation 
after which a tracking mechanism follows its temporal 
evolution in the subsequent frames. The object 
boundaries may need to be modified and updated either 
via further interaction" or according to some low-level 
homogeneity criteria’. Fully automatic extraction 
methods apply the rules of defining semantic objects in 
an algorithmic way. These rules are based on special 
characteristics of the scene or on specific knowledge (a 
priori information). They are derived for a specific 
class of applications. An example of such applications 
is chroma-keying, employing a background of known 
color (usually blue or green) and extracting objects by 
discarding background pixels. In other approaches, 
some knowledge of the objects to be extracted are 
utilized. An example is a face detection algorithm, 
which makes use of the shape and location of various 
facial features to extract information”. In other cases 
motion information can be used to extract moving 
objects from a scene. Motion estimation aims at 
characterizing apparent motion by a displacement 
vector or an optical flow field. A displacement vector 
describes the displacement of a pixel in two frames of 
the video, The optical flow is the distribution of the 
apparent velocities that can be associated with the 
apparent motion. Estimating motion is a difficult task 
as it is highly sensitive to noises and variation in scene 
illumination. Another problem in motion estimation is 
referred to as the occlusion problem. This stems from 
the fact that the object in question may be occluded by 
other objects in some of the frames thereby producing a 
gap in its motion vectors over a period of time. To cover 
the gap assumptions are needed which are usually 
smoothness constraints on the optical flow field to 
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achieve continuity. Reviews of the different techniques 
are proposed in” and comparative tests for 
performance evaluation are reported in”. 


4.4 Similarity Metrics 


Unlike traditional databases, most of the queries in 
multimedia databases are based on similarity. For 
example, “Find all images that are similar to a given 
image, within a user-specified range” or “Given an 
image, find the five most similar images”. The 
similarity search problem therefore becomes a nearest 
neighbour search problem in the feature space. 
According to the elements considered, distance 
measures may be classified as point to point, set-to-set 
and point-to-set. A point-to-point distance vector 
computes the proximity between two feature vectors. 
Let $1 8p5-458,) and T=(t,,t,,...5f,) be two 
feature vectors in n dimensional feature space. Then 
the difference diff'(S,T) between S and T using an L, 
metric is as follows: 
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where p is the order of the metric. For p=/ it becomes 
the Manhattan distance, 


diffS,T)=>\s, - 6 
i=l 


while for p=2 it becomes the Euclidean distance. 


dih.1) =| Ys, >t 
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For p—>co the distance is defined by : 
diff(S,T) = max |s, - t] 
n 


A set-to-set distance computes proximities of two sets 
C, and C, for which several strategies may be adopted. 
For example it may be computed as a function of the 
point to point distances of the points belonging to the 
two sets, The minimum, the maximum or the average 
distance may then be chosen. The computational 
complexity of this method is high and increases with 
the cardinality of the sets. A simpler solution is to 
choose a representative of each set and then to compute 
the point-to-point distances between them. To account 
for the cardinality of the sets, a normalization 
procedure may be introduced. A point-to-set distance 
computes the proximity between a feature point S anda 
set C. An example is considering the point to point 


distance from S from all the elements in C and then 
considering the minimum, maximum or average. Such 
a method is again computationally heavy if the 
cardinality of C is high. A simpler solution is to take a 
representative value of C and then to compute the 
distances. Point-to-set distances are typically used in 
region growing techniques. 


4.5. Indexing Mechanisms 


A similarity metric enables us to compute 
similarity between vectors using distance as a 
measure. However since multimedia data is 
potentially computation intensive efficient 
calculations require indexing mechanisms, If the 
MDB has a small size of the order of hundreds of 
objects a sequential scan of the entire database may 
be adequate. However as the database size grows 
the sequential scan is not a reasonable 
arrangement, especially ifsimilarity computations 
are non-trivial, What is needed is a way to filter out 
objects that are non-relevant to the query without 
dismissing relevant objects. In the previous 
section we saw that our concept of similarity can be 
modeled by means of a distance function in a 
suitable metric space. In order to deal with such 
queries an access method should implement a 
clustering method, for grouping together similar 
objects, and a way to represent such cluste 
indexing purposes. This means that cla 
tabular structures like B-trees are not suitable, To 
support spatial search operations multi- 
dimensional or spatial access methods (SAM) are 
required. In recent times a plethora of such 
methods have been proposed”. The search 
operations that concerns us most are those of range 
and nearest neighbour queries. A range query 
defines a region of the space centered on the query 
value and whose shape depends on the used metric. 
The R-tree™ is a multi-dimensional generalization 
of the B-tree and may be used to store rectangular 
regions of an image”. Because disk accesses are 
often slow, R-trees provide a convenient way of 
minimizing the number of disk accesses. Each R- 
tree has an associated order which is an integer K, 
Each non-leaf R-tree node contains a set of at most 
K rectangles and at least K/2 rectangles. This 
feature makes R-trees appropriate for disk based 
retrieval because each disk access brings back a 
page containing several i.e. at least K/2, 
rectangles. One drawback of the R-tree stems from 
the fact that since node regions may overlap even 
an exact match point query may lead to multiple 
search paths. Variants of the R-tree have been 
proposed to improve searching efficiency. These 
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tree". For storing higher dimensional point data, 
KD-tree has been proposed”. Here each node 
represents a point in a k- dimensional space and is 
associated with an array of length k. The answer toa 
range query is the set ofall points within a specified 
distance of the query point. 


4.6 Semantic Interpretation 


Extraction and representation of low-level features 
and semantic objects is not sufficient to 
appropriately handle user queries because the 
system still cannot interpret what the features or 
objects mean. Humans are most likely to post queries 
mentioning semantic concepts like people, trees, 
road, car and events like running, laughing, 
explosion ete. The gap between recognizing a 
pattern of features and knowing what it means is to 
be bridged by feeding knowledge to the system. 
Based on this knowledge, the system should not only 
be able to interpret the meaning of the objects but 
also to learn and deduce about related matters. We 
propose the inclusion of a knowledgebase into the 
architecture to accomplish these tasks. Only in very 
recent times researchers have begun looking into the 
task of automatic semantic interpretation. In“ the 
authors have used Bayesian networks to characterize 
multimedia semantic features into three categories : 
objects (man, car), sites (indoor, beach) and events 
(walking, explosion), collectively called multijects. 
To make the system learn and infer, it is pointed out 
that detection of certain multijects may boost or 
reduce the chances of detecting certain other 
multijects. For example the detection of “sky” and 

water” boosts the chances of “beach” and reduces 
the chances of detecting “indoors”. An important 
observation from this interaction is that it might be 
possible to infer some concepts, whose detection 
may be difficult, based on their interaction with 
other concepts whose detection may be relatively 
easier. In” the authors propose a model for 
predicting words associated with whole images and 
specific image regions. In™ the authors propose a 
way to discover and measure statistical relationships 
among concepts from images and corresponding text 
annotations. To estimate spoken words and identify 
the speaker in an audio clip Hidden Markov Models 
(HMM) are the most widely used and produce the 
best performance”. Recent contributions in 
statistical learning have provided new learning 
methods. Taskar et al.” proposes a new framework — 
the Maximum Margin Markov Network (M'‘N) 
which integrates kernel methods, which maximize 
the margin of confidence of the classifier, with 
graphical methods, which can exploit the complex 
structure of multimedia. 





4.7 Database Model 


From the discussions of the previous sections we can 
now arrive ata model of the multimedia database. 





mi 


m2 




















Fig. 6 : Functional blocks of the database. 


The feature extraction block X is primarily concerned 
with the extraction of low-level features from the 
media components kept at the MRP. It would comprise 
ofa set of algorithms which would read the raw binary 
data of media files and extract a set of pre-defined 
features e.g. pixel values from image files, sample 
values from audio files. The algorithms would be 
different for different media types, even for the same 
type the algorithms would vary depending on which 
feature needs to be extracted. In general it could be 
represented as a combination of different extraction 
functions with weights: 


X=w,X,+w,X,+w,X,t... 


where X,, X,, X,,...are the functions for extraction of 
specific features and w,,w,,w. ‘e their respective 
weights which determine their importance or influence 
in the process. In general these features would not be 
utilized in their native form but would need to be 
converted into another form for effective storage and 
retrieval processes. There are two main reasons for 
such conversion. First, considering all these features 
might not be feasible as their numbers could be very 
large e.g. an image 1000 = 1000 pixels gives rise to 
three million (R, G, B ) values, a CD quality audio 
contains 44100 samples every second. A conversion 
algorithm would therefore need to transform the 
original feature values into a more compact 
representation. An example would be to segmentan 
image into a number of partitions and compute a 
brightness histogram for each partition. The second 
reason for the conversion is that the form in which the 
features are extracted may not be suitable or efficient 
for subsequent processing. For example pixel values in 
an image are usually extracted as RGB triplets, but the 
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RGB color space not being perceptually uniform, 
uniform quantization of the color space leads to 
redundant bins and holes in smooth color variation. 
Hence conversions to HSB or CIE-Lab color spaces 
are recommended. The function C represents all these 
types of conversions which might be necessary. 
Typically the output from this stage would a set of 
multi-dimensional feature vectors that can be mapped 
into a feature space before being compared. The next 
step is the identification of semantic objects and is 
done by function M. Here algorithms attempt to 
interpret low-level features to derive meaningful 
entities from them in manual, semi-automatic or 
automatic ways. Fully automatic processes require a 
priori information which could be inherent within 
function M e.g. known color of a scene background. 
Manual and semi-automatic processes would require 
additional semantic information, represented here by 
m1, which would need to be inserted in a supervised 
fashion e.g. a contour of an object in a scene. The next 
step is the function 7 where the features are stored 
using an appropriate index mechanism e.g. Rtree or 
KD-tree. This would be required for efficient search 
and retrieval functions. Although low-level content 
form the major portion of the indexed data, in some 
cases high-level semantic content may also need to be 
inserted, represented by m2. This step would be 
optional and require intervention by a human content 
expert. Typically such information would be 
represented by text strings. An example would be the 
insertion of name and date of birth of a person in a 
photograph. Such intervention would however 
typically be in very small amounts. After all the low- 
level features and semantic objects have been extracted 
the final step would be to interpret their identities. For 
this a knowledgebase K is inserted within the database 
which would typically attempt to match the 
information kept in 7 with previous information within 
it inserted during a training phase, and infer about 
presence of real-world entities within the media 
objects. The output from K would typically be text 
strings (labels) describing the interpreted object along 
with their locational information within a scene e.g. a 
tree to the left of a road. Since each of these different 
levels of interpretations have been derived from X, they 
inherit the symbolic link to the corresponding media 
file kept in the MRP. The searching function § would 
attempt to find a set of media components which 
satisfy criteria in query g by considering both low-level 
and semantic information. 





4.8 Proposed Layered Architecture 


To process information at various levels this paper 
proposes a layered architecture for content based 
retrieval functions. Each layer processes media 


information at a specific level and hands over the 
results to the next higher layer for a progressive 
understanding of the media content. The bottom- 
most layer is the Physical Layer which deals with 
issues for representing the raw data of the original 
media component. To illustrate the model we 
consider a video clip since it is the most complex 
media type and embodies other media types like 
image and audio within it. A video segment can be 
represented as f(x, y,4) in this layer, where the visual 
and audio content is considered a function of two 
locational coordinates.x and y, and time /, 
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Fig. 7: Layered architecture of the CBR engine. 


The next layer is the Features Layer which deals with 
the functions for representing the features in an 
appropriate feature space for effective clustering. 
This results in the generation of the video segment 
g(x, ¥, t) from the original representation due to a 
transformation F. 


(XVOD XO} 


An example of such conversion is the transformation 
of pixel data from the RGB into the HSV color 
space” using equations proposed in”. This layer 
also deals with extraction of low-level features which 
leads to the decomposition of the media into a 
collection of homogeneous regions R, corresponding 
to perceptually uniform areas. The process is known 
as region segmentation, \ 


AX VO= U R, 
il 
where N, is the number of regions in the video. Also 
these regions ina specific frame n do not overlap. Hence 


R(n)NR(1)= 0, if 74 7 


The third layer called the Object Layer would deal 
with the identification and representation of 
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semantic objects, a process known as semantic 
segmentation. The semantic partitions are defined 
through human abstractions, consequently the 
definition of the partition depends on the tasks to be 
performed. In general the partition cannot be 
expressed through homogeneity criteria e.g. same 
color, because the elements of such a partition do not 
possess invariant properties. Additional semantic 
information m can optionally be added to this layer 
through human intervention to improve 
identification of objects. The output of this layer 
would be in the form of a collection of semantic 
partitions or objects S,: 





N, 
2(X,), o=Us 
jl 


where WN, is the total number of semantic objects in the 
video. Also these semantic objects in a specific frame n 
do not overlap. Hence 

S(n)MS(n)=0, if 47 
A generic semantic object can be considered to be 


divided into a number of regions such that 
N 


s, = UR, 
=1 
where R,,, represents j régions of the i-th object, and 
N, is the number of regions in a semantic object. The 
situation is depicted in the figure below. These SOs can 
be identified if for example the background is of a 
known color, which is not present in any of the objects, 
and noting that two objects cannot overlap. 
Regions Semantic objects 





Regions within objects 

Fig. 8 : Identifying objects and regions. 

It is to be noted that the concept of SOs is not restricted 
to visual objects only, it can be applied to audio as well. 
SOs in an audio clip can be a person speaking, a train 
leaving a station, a musical composition while regions 
can comprise of short sounds having an overall 
homogeneous characteristic like a laugh, scream, 
gunshot, whistle, breaking glass or a single stroke of a 
violin. 


The topmost layer, called the Interpretation Layer 
would include a semantic knowledgebase for proper 
interpretation of all the information. Typically it would: 





accept information about SOs from the Objects Layer and 
some additional information ‘ from a previously trained 
knowledgebase. It would then attempt to match 
characteristics in order to derive an interpretation about 
the SO. Usually the interpretation engine would be based 
on a probabilistic framework like a Hidden Markov 
Model (HMM). First an HMM would be trained based on 
the feature vectors of a specific media type, which would 
constitute the knowledgebase. This involves attaching 
semantic labels to a time sequence of low level features 
e.g. man walking, bird flying, indoor, outdoor, sunrise etc. 
The training set is labeled independently for audio and 
video components using the HTK Toolkit. The Viterbi 
algorithm is then used to find the best possible state 
sequence given the trained HMMs and the feature vectors. 


The optimal state sequences found by the Viterbi 
algorithm in the video and audio are then used as inputs to 
a supervisor HMM, which fuses the modalities. Its 
observations are the states of the media HMMs. Since the 
observation rate of the video and audio are not the same, 
the state of the media HMMs are sampled at a fixed rate to 
produce the observation sequence for the supervisor 
HMM. The supervisor HMM is then trained after which it 
along with the media HMMs emit the probability of the 
occurrence of a media object. The effectiveness of such a 
model is reported in”. 


5. Playback and Synchronization 


The presentation is played back on the PBM by 
retrieving actual media content from the MRP and the 
spatiotemporal parameters and event handlers, 
together referred to as authoring parameters, from the 
MMP. Important issues which need to be considered 
here are the delay factor, jitter and skews for effective 
synchronization of parallel media streams. Two or 
more media files retrieved separately from the MRP 
may need to be played together e.g. an audio and video, 
or a sequence of images along with textual 
descriptions. The delay factor usually refers to the time 
taken for the objects to travel from the source to the 
destination, but for interactive applications it also 
includes the time for the user interaction to travel from 
the destination to the source. Due to variations in delay 
factors, temporal relationships between media 
components can deviate from their desired values. This 
is referred to as jitter and quantitatively refers to the 
instantaneous difference between the desired 
presentation times and actual presentation times. Skew 
is the average jitter value over a period of time. The 
major requirements for audio/video synchronization in 
multimedia systems are bounded jitters within a 
continuous stream, and minimal and acceptable skews 
among dependent streams. Steinmetz's experiments” 
showed that for audio/video synchronization skews 
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less than 80 ms are ideal, between 80 and 160 ms are 
acceptable and beyond 160 ms are considered 
annoying. If the MRP is resident in the same local 
system where the presentation is played back we can 
ignore the skew produced in traveling from the MRP to 
PBM and assume that they are available at the PBM 
instantaneously. Synchronization could then be 
imposed by the MMP script in two ways. The first way 
is by ensuring that the data elements begin playback at 
the same instant within the presentation and assume 
that they would be able to maintain synchronization 
after that. This is suitable for situations where 
synchronization requirements are not very rigid i.e. 
relatively large skews may be allowable. This 
restriction can be implemented by the script e.g. the 
<PAR> and <SEQ> commands of SMIL" for parallel 
and sequential playback. The second way is by placing 
pre-defined markers in the media files and then 
imposing the playback head to cross the corresponding 
markers in different files at the same point in time. In 
this case the markers could be placed by the developer 
in the MRP and the MMP script should be able to 
recognize them, Ifa set of media components needs to 
play together in parallel maintaining synchronization, 
as the playback proceeds to the first marker of one of 
the files, the script should be able to pause the playback 
of this file until the playback of other files proceeds to 
their corresponding first markers; then the playback of 
all files again proceeds together. In a way the second 
method can be considered as a repetitive application of 
the first method over a number of points within each 
file. This method is suitable for situations where 
synchronization requirements are strict i.e. where large 
skews are not allowed. An example is the recognition 
of cue points by Lingo’. 






Ina distributed network environment the MRP and PB 
would be located on different systems. Maintaining 
synchronization in such cases is more gomplex than in 
local environments due to the distributed storage of 
synchronization information and different locations of 
media objects. There are two approaches : (1) best 
effort environment with adaptive protocols (2) 
reservation based environment. In the first case 
networks and OS maximize the throughput and do not 
provide QoS guarantees e.g. Internet network with 
UNIX OS support at end-points. Adaptive 
transmission protocols like VDP”, vat and RTP/IPare 
deployed on top of this environment. In the second 
case the network and OS use reservation, admission 
and enforcement algorithms to provide end to end QoS 
guarantees e.g. ATM network with QoS-aware 
resource management in the OS kernel. Examples of 
protocols in this type of system include the native ATM 
transport protocol in QualMan™ and RSVP/IP. A 
number of different techniques have been 


proposed for multimedia synchronization to satisfy 
diverse requirements, which implement 
synchronization control at, various locations at the 
source and destination. In” the authors suggest two 
different schemes for inter-stream synchronization, the 
synchronization marker for indication of 
synchronization points and the synchronization 
channel over which the markers are transmitted. 
Escobar's scheme” takes into account the dynamic 
changes in network delays by monitoring the jitter and 
re-synchronizing at the receiver if necessary. However 
it assumes the presence of a global clock in a 
distributed environment at all times. Similar approach 
is taken in Rothemel's scheme". Another approach i is to 
send feedbacks like Ramanathan's scheme”. The 
playback devices provide feedback m 
server based on which the server can es 
earliest and latest times by which playback needs to be 
initiated. In order to determine temporal relationships 
between streams relative. time stamps are assigned to 
each media unit. Rangan™ disc the continuity and 
synchronization issues in MPEG compressed video 
streams. The choice of the actual scheme depends on 
the system requirements and limitations. According to 
Nahrstedt's studies™ the adaptive synchronization has a 
low demand on underlying resource managements, the 
call establishment phase is simple and fast, however 
this protocol needs to work very hard during the 
transmission phase in order to balance the load and 
adapt against load variations. Oscillation of 
synchronization skews in the desirable range (-80, 80) 
ms and acceptable range (-160, 160) ms are observed. 
On the other hand, the reservation based 
synchronization has a high demand on the underlying 
resource management as it provides differentiation 
over network resources. The call establishment phase 
is complex, but once the QoS connections are 
established, the synchronization protocol is very 
simple. The only issue is to start the playback in a 
synchronized fashion and coordinate reservations 
between media types. The achieved performance is 
better and the skews are bounded in the range of (-10, 
10) ms. 








6. Conclusions and Future Scope 


This paper proposes a framework for converting a 
multimedia presentation from its present day concept 
of displaying a set of fixed information to something 
which can be dynamically adjusted to suit the 
requirements of individual users. By representing the 
presentation as a set of meaningful objects, each 
associated with its own spatio-temporal parameters, it 
is possible to not only populate it with different content 
at different points in time by searching and selecting 
from a repository of media elements, but also to update 
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the contents quickly and without much effort. It can 
also lead to an improved performance over low- 
bandwidth channels as objects which remain the same 
over subsequent frames need only be transmitted 
once, while other objects can be transmitted at 
different frequencies based on their variation patterns 
within the presentation. Another advantage of the 
scheme is its compatibility to new emerging 
standards. Object based compression schemes like 
MPEG-4” can be applied to compress each object ina 
presentation separately, the compression rate being a 
function of its importance in the presentation. This 
would reduce the overall bit-rate while preserving 
perceived quality. Standard description schemes like 
MPEG-7" can be applied to the CBR module. 
Moreover using a scripting representation like SMIL, 
presentations could be generated and redistributed 
using template based designs. Ready-made 
customized templates would help to cut down 
authoring time drastically. Schemes like MPEG-21° 
can be used to provide IPR protection through its 
Rights Expression Language and Rights Data 
Dictionary. The inclusion of a knowledge-base adds 
scalability to the architecture as it provides 
oppurtunities to incoporate new learning mechanisms 
and interpretation schemes, while the media 
repository makes it easy to add new media 
components. Moreover the seperation of the content 
from the layout means that each could be optimized 
independently for maximum efficiency. The paper 
also discusses each of the modules in the light of 
existing and on-going research activities to 
corroborate its feasibility of practical implementation. 
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